Summary

I used the UCI HAR Weight Lifting Exercise dataset. I split the ‘training’ dataset into 60%/40%, and trained a random forest model on the 60%, using 5 fold cross-validation. The model predicted a 1.2% out of sample error. I was able to successfully predict classe in the remaining 40% of training data with 99.2% accuracy, meaning that the model had an actual 0.8% out of sample error. This model successfully predicted all 20 of the testing problems.

Data analysis, tidying & feature selection

setwd("~/Documents/Reference/MOOC/Data Science/code/machlearn")
data <- read.csv("pml-training.csv",header=TRUE,sep=',',na.strings=c("NA","#DIV/0!",""))
finaltestdata <- read.csv("pml-testing.csv",header=TRUE,sep=',', na.strings=c("NA","#DIV/0!"))

# make reproducible
set.seed(345)

# setting up training and testing set locally; will train models on 60% of cases, and test against 40% to estimate out of sample error and see how we're doing.
inTrain = createDataPartition(data$classe, p = 0.6)[[1]]
pml.training = data[inTrain,]
pml.testing = data[-inTrain,]

The initial dataset had 160 variables, so the first order of business was to reduce the number of columns to only the ones with the best predictive power. From examining the data there were several columns which had a preponderance of ‘NA’ values (or #DIV/0! errors from Excel). I also eliminated columns which were not sensor-based and therefore unlikely to be predictive (notably the first eight columns, with subject name, time stamp and time window information, etc.). Finally, from the remaining columns I identified and eliminated those which were highly correlated.

# eliminate columns with NA
pml.training <- pml.training[,colSums(is.na(pml.training)) < nrow(pml.training) * 0.5]

# eliminate non-sensor readings 
pml.training <-pml.training[,8:ncol(pml.training)]

# remove highly correlated variables
# calculate correlation matrix using every variable except the last one (which is classe, our outcome)
correlationMatrix <- cor(pml.training[,-ncol(pml.training)])

# eliminate attributes that are highly corrected (ideally >0.75)
pml.training <- pml.training[,-(findCorrelation(correlationMatrix, cutoff=0.75))]

# zero- or near-zero variance?
nearZeroVar(pml.training, saveMetrics= TRUE)
##                      freqRatio percentUnique zeroVar   nzv
## yaw_belt              1.050336   14.35122283   FALSE FALSE
## gyros_belt_x          1.076155    1.08695652   FALSE FALSE
## gyros_belt_y          1.158576    0.56046196   FALSE FALSE
## gyros_belt_z          1.105014    1.39266304   FALSE FALSE
## magnet_belt_x         1.033493    2.53057065   FALSE FALSE
## magnet_belt_y         1.047739    2.30978261   FALSE FALSE
## roll_arm             52.717949   19.81148098   FALSE FALSE
## pitch_arm            70.931034   22.52038043   FALSE FALSE
## yaw_arm              33.704918   21.51834239   FALSE FALSE
## total_accel_arm       1.089354    0.55197011   FALSE FALSE
## gyros_arm_y           1.378125    3.04008152   FALSE FALSE
## gyros_arm_z           1.098726    1.97010870   FALSE FALSE
## magnet_arm_x          1.000000   11.10733696   FALSE FALSE
## magnet_arm_z          1.029851   10.59782609   FALSE FALSE
## roll_dumbbell         1.037975   87.60190217   FALSE FALSE
## pitch_dumbbell        2.632911   85.35156250   FALSE FALSE
## yaw_dumbbell          1.053333   86.98199728   FALSE FALSE
## total_accel_dumbbell  1.102217    0.34816576   FALSE FALSE
## gyros_dumbbell_x      1.067751    1.98709239   FALSE FALSE
## gyros_dumbbell_y      1.312139    2.20788043   FALSE FALSE
## gyros_dumbbell_z      1.061662    1.60495924   FALSE FALSE
## magnet_dumbbell_z     1.141509    5.58763587   FALSE FALSE
## roll_forearm         10.653153   14.75883152   FALSE FALSE
## pitch_forearm        63.891892   21.26358696   FALSE FALSE
## yaw_forearm          14.677019   13.90964674   FALSE FALSE
## total_accel_forearm   1.081925    0.56895380   FALSE FALSE
## gyros_forearm_x       1.048544    2.36922554   FALSE FALSE
## gyros_forearm_y       1.120370    6.03770380   FALSE FALSE
## gyros_forearm_z       1.104530    2.45414402   FALSE FALSE
## accel_forearm_x       1.068966    6.55570652   FALSE FALSE
## accel_forearm_z       1.010753    4.64504076   FALSE FALSE
## magnet_forearm_x      1.083333   12.00747283   FALSE FALSE
## magnet_forearm_y      1.196078   15.19191576   FALSE FALSE
## magnet_forearm_z      1.000000   13.42561141   FALSE FALSE
## classe                1.469065    0.04245924   FALSE FALSE

The resulting dataset is much smaller.

dim(pml.training)
## [1] 11776    35

Model construction

I used a random forest model with 5-fold cross-validation. Cross-fold validation estimates the out of sample error to be 1.2% (that is, the accuracy to be 98.8%).

model <- train(classe ~ ., data=pml.training, method="rf", trControl=trainControl(method="cv",number=5),
                allowParallel=TRUE, tuneGrid=data.frame(mtry=10))
print(model)
## Random Forest 
## 
## 11776 samples
##    34 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9421, 9421, 9420, 9422, 9420 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.9878569  0.9846376  0.001712162  0.002166542
## 
## Tuning parameter 'mtry' was held constant at a value of 10
## 

I also tested this model against the known 40% of the original training set, just to see how well it predicted those variables.

predicted <- predict(model,newdata=pml.testing)
table(pml.testing$classe,predicted)
##    predicted
##        A    B    C    D    E
##   A 2228    1    0    1    2
##   B   14 1500    4    0    0
##   C    0   11 1344   13    0
##   D    0    0   13 1272    1
##   E    0    0    4    3 1435
prop.table(table(pml.testing$classe,predicted),1)
##    predicted
##                A            B            C            D            E
##   A 0.9982078853 0.0004480287 0.0000000000 0.0004480287 0.0008960573
##   B 0.0092226614 0.9881422925 0.0026350461 0.0000000000 0.0000000000
##   C 0.0000000000 0.0080409357 0.9824561404 0.0095029240 0.0000000000
##   D 0.0000000000 0.0000000000 0.0101088647 0.9891135303 0.0007776050
##   E 0.0000000000 0.0000000000 0.0027739251 0.0020804438 0.9951456311
prop.table(table(pml.testing$classe,predicted),2)
##    predicted
##                A            B            C            D            E
##   A 0.9937555754 0.0006613757 0.0000000000 0.0007757952 0.0013908206
##   B 0.0062444246 0.9920634921 0.0029304029 0.0000000000 0.0000000000
##   C 0.0000000000 0.0072751323 0.9846153846 0.0100853375 0.0000000000
##   D 0.0000000000 0.0000000000 0.0095238095 0.9868114818 0.0006954103
##   E 0.0000000000 0.0000000000 0.0029304029 0.0023273856 0.9979137691

Since the accuracy is 99.2% - even better than predicted by the cross-validation - this is a good result. And indeed all 20 answers check out. Success!

Appendix: Data exploration

Just a quick multiplot to look at the 35 columns used and to see whether any are obviously ones worth exploring further or disregarding? (Yes, some level of data visualization overload, but I did put it at the end, and you can consider this a lattice / facet grid.)

library(ggplot2)
library(grid)
library(gridExtra)
plots=list()
predictors = colnames(pml.training[-ncol(pml.training)])
for (pred in predictors) {
  p <-  ggplot(pml.training, aes_string(x='classe',y=pred)) + geom_boxplot() + geom_jitter(color="#56B4E9", alpha=0.05)
  plots[[pred]] <- p
}
do.call("grid.arrange", c(plots, ncol=3))